202 ◾ Bioinformatics
As shown in Figure 5.27, the heatmap clusters genes and samples based on Euclidean dis-
tance between the expression values. As expected, samples from the same group are clus-
tered together.
5.3.7.9 Ontology and Pathways
After identifying the differentially expressed genes, the next step is to study the func-
tions of these genes, their pathways, and the conditions associated with them based on
the accumulated knowledge that already we have from previous studies and discoveries.
This knowledge is available in databases like GO [37] and KEGG [38] databases as well as
other pathway databases. Many of the genes associated with given biological processes are
differentially expressed in a given condition like diseases. By performing GO analysis, we
will be able to identify those biological processes, cellular locations, and molecular func-
tions that are impacted by the condition studied. GO attempts to capture three aspects of
the gene: (i) Biological processes (BP) that the gene may involve in, (ii) Molecular functions
(MF), and (iii) Cellular components (CC), where the biological processes and molecular
activities take place in the cell. It is important to know that GO terms aim to describe the
normal functions, processes, or locations that gene products are involved in. It does not
capture pathological processes, experimental conditions, or temporal information. Given a
set of differentially expressed genes (upregulated or downregulated genes), GO and KEGG
analyses will identify the GO terms and pathways, respectively, for each gene. EdgeR uses
“goana” function for GO analysis and “kegga” function for KEGG analysis. Both functions
require a DGELRT object and Entrez Gene identifiers (IDs) to annotate the genes. The
NCBI Entrez Gene IDs must be present as row name as we did above. Also, it is important
to specify the species studied. The following script performs GO analysis and annotates the
significantly expressed genes (downregulated and upregulated genes) with the GO terms:
FIGURE 5.28 Ontology annotation of the significantly expressed genes.